A review of last week’s contents
Fundamental Techniques in Data Science with R
A review of last week’s contents
library(mvtnorm) # Multivariate Normal tools library(dplyr) library(magrittr) library(ggplot2) library(mice)
table(): Build a contingency table out of countshist(): Draw a histogrammvtnorm::rmvnorm(): Drawing values from a multivariate normal distributionsample(): Probabilistic sampling with or without replacementrchisq(): Probabilistic sampling from a \(X2\) distributionrpois(): Probabilistic sampling from a poisson distributionset.seed(123): Fixing the Random Number Generator seedThere are four key assumptions about the use of linear regression models.
In short, we assume
\[ y = \alpha + \beta_1X_1 + beta_2X_2 + beta_3X_3 + \epsilon\]
The residuals are normally distributed with mean \(\mu_\epsilon = 0\)
\[y = \alpha + \beta X + \epsilon\] and \[\hat{y} = \alpha + \beta X\] The components in linear modeling
As a data set, this would be the result for lm(y1 ~ x1, data = anscombe)
Here the residuals are not independent, and the residual variance is not constant!
Here the residuals do not have mean zero.
Here the residual variance is not constant, the residuals are not normally distributed and the relation between \(Y\) and \(X\) is not linear!
anscombe examplefit <- anscombe %$% lm(y1 ~ x1) fit %>% summary
## ## Call: ## lm(formula = y1 ~ x1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.92127 -0.45577 -0.04136 0.70941 1.83882 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.0001 1.1247 2.667 0.02573 * ## x1 0.5001 0.1179 4.241 0.00217 ** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1.237 on 9 degrees of freedom ## Multiple R-squared: 0.6665, Adjusted R-squared: 0.6295 ## F-statistic: 17.99 on 1 and 9 DF, p-value: 0.00217
qqnorm()qqnorm with simulated errors cf. rnorm(n, 0, s)anscombe %>% ggplot(aes(x1, y1)) + geom_point() + geom_smooth(method = "loess", col = "blue") + geom_smooth(method = "lm", col = "orange")
The loess curve suggests slight non-linearity
Does adding a squared term improve fit?
anscombe %$% lm(y1 ~ x1 + I(x1^2)) %>% summary()
## ## Call: ## lm(formula = y1 ~ x1 + I(x1^2)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.8704 -0.3481 -0.2456 0.7129 1.8072 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.75507 3.28815 0.230 0.824 ## x1 1.06925 0.78982 1.354 0.213 ## I(x1^2) -0.03162 0.04336 -0.729 0.487 ## ## Residual standard error: 1.27 on 8 degrees of freedom ## Multiple R-squared: 0.6873, Adjusted R-squared: 0.6092 ## F-statistic: 8.793 on 2 and 8 DF, p-value: 0.009558
data.frame(y = anscombe$y1, x = predict(fit)) %>% ggplot(aes(x=x, y=y)) + geom_point() + geom_abline(intercept = 0, slope = 1)
Plot predicted y1 against observed y1.
data.frame(y = anscombe$y2, x = predict(fit)) %>% ggplot(aes(x=x, y=y)) + geom_point() + geom_abline(intercept = 0, slope = 1)
When we substitute observed y1 with observed y2, it is apparent that the relation (y2, x1) is not linear.
lmpar(mfrow = c(2, 3)) fit %>% plot(which = c(1:6), cex = .6)
par(mfrow = c(1, 2)) fit %>% plot(which = c(1, 3), cex = .6)
par(mfrow = c(1, 2)) boys %$% lm(bmi ~ age) %>% plot(which = c(1, 3), cex = .6)
fit %>% plot(which = 2, cex = .6)
The QQplot shows some divergence from normality at the tails
Leverage: see the fit line as a lever.
Standardized residuals:
Cook’s distance: how far the predicted values would move if your model were fit without the data point in question.
Outliers are cases with large \(e_z\) (standardized residuals).
If the model is correct we expect:
Influential cases are cases with large influence on parameter estimates
par(mfrow = c(1, 2), cex = .6) fit %>% plot(which = c(4, 5))
There are no cases with \(|e_z|>2\), so no outliers (right plot). There are no cases with Cook’s Distance \(>1\), but case 3 stands out
We saw that we could create vectors
c(1, 2, 3, 4, 3, 2, 1)
## [1] 1 2 3 4 3 2 1
c(1:4, 3:1)
## [1] 1 2 3 4 3 2 1
We could create matrices
matrix(1:15, nrow = 3, ncol = 5, byrow = TRUE)
## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 2 3 4 5 ## [2,] 6 7 8 9 10 ## [3,] 11 12 13 14 15
Vectors from matrices
mat <- matrix(1:18, 3, 6) mat
## [,1] [,2] [,3] [,4] [,5] [,6] ## [1,] 1 4 7 10 13 16 ## [2,] 2 5 8 11 14 17 ## [3,] 3 6 9 12 15 18
c(mat)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Of the same size,
sample(mat)
## [1] 1 7 17 18 12 14 3 13 9 15 8 16 11 10 6 2 4 5
smaller size,
sample(mat, size = 5)
## [1] 16 14 3 5 15
or larger size
sample(mat, size = 50, replace = TRUE)
## [1] 3 18 11 18 6 8 4 1 13 7 1 17 6 17 8 3 11 7 18 7 18 10 5 ## [24] 11 1 6 3 16 8 3 8 4 5 1 6 8 16 5 1 5 8 16 2 16 10 2 ## [47] 4 10 9 11
hist(sample(mat, size = 50000, replace=TRUE), breaks = 0:18)
probs <- c(rep(.01, 15), .1, .25, .50) probs
## [1] 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 ## [15] 0.01 0.10 0.25 0.50
hist(sample(mat, size = 50000, prob = probs, replace=TRUE), breaks = 0:18)
set.seed(123) sample(mat, size = 50, replace = TRUE)
## [1] 15 14 3 10 18 11 5 14 5 9 3 8 7 10 9 4 14 17 11 7 12 15 10 ## [24] 13 7 9 9 10 7 6 2 5 8 12 13 18 1 6 15 9 15 16 6 11 8 7 ## [47] 16 17 18 17
set.seed(123) sample(mat, size = 50, replace = TRUE)
## [1] 15 14 3 10 18 11 5 14 5 9 3 8 7 10 9 4 14 17 11 7 12 15 10 ## [24] 13 7 9 9 10 7 6 2 5 8 12 13 18 1 6 15 9 15 16 6 11 8 7 ## [47] 16 17 18 17
set.seed(123) sample(mat, size = 5, replace = TRUE)
## [1] 15 14 3 10 18
sample(mat, size = 7, replace = TRUE)
## [1] 11 5 14 5 9 3 8
set.seed(123) sample(mat, size = 5, replace = TRUE)
## [1] 15 14 3 10 18
sample(mat, size = 7, replace = TRUE)
## [1] 11 5 14 5 9 3 8
The random seed is a number used to initialize the pseudo-random number generator
If replication is needed, pseudorandom number generators must be used
The initial value (the seed) itself does not need to be random.
When an R instance is started, there is initially no seed. In that case, R will create one from the current time and process ID.
If we fix the random seed we can exactly replicate the random process
If the method has not changed: the results of the process will be identical when using the same seed.
We can draw data from a standard normal distribution
hist(rnorm(1000, mean = 5, sd = 1))
We can draw data from a specific normal distribution
hist(rnorm(1000, mean = 50, sd = 15))
hist(rnorm(100000, mean = 50, sd = 15))
Random sample
Population parameter
Sample statistic
The sampling distribution is the distribution of a sample statistic
Conditions for the CLT to hold
Statistical tests work with a null hypothesis, e.g.: \[ H_0:\mu=\mu_0 \]
\[ t_{(df)}=\frac{\bar{x}-\mu_0}{SEM} \]
The \(t\)-distribution has larger variance than the normal distribution (more uncertainty).
curve(dt(x, 100), -3, 3, ylab = "density")
curve(dt(x, 2), -3, 3, ylab = "", add = T, col = "red")
curve(dt(x, 1), -3, 3, ylab = "", add = T, col = "blue")
legend(1.8, .4, c("t(df=100)", "t(df=2)", "t(df=1)"),
col = c("black", "red", "blue"), lty=1)
curve(dnorm(x), -3, 3, ylab = "density")
curve(dt(x, 2), -3, 3, ylab = "", add = T, col = "red")
curve(dt(x, 1), -3, 3, ylab = "", add = T, col = "blue")
legend(1.8, .4, c("normal", "t(df=2)", "t(df=1)"),
col = c("black", "red", "blue"), lty=1)
The \(p\)-value in this situation is the probability that \(\bar{x}\) is at least that much different from \(\mu_0\):
We would reject \(H_0\) if \(p\) is smaller than the experimenters’ (that would be you) predetermined significance level \(\alpha\):
Example of two-sided test for \(t_{(df=10)}\) given that \(P(t<-1.559)=7.5\%\) (\(\alpha\) = 0.15)
t0 <- qt(.075, 10) cord.x1 <- c(-3, seq(-3, t0, 0.01), t0) cord.y1 <- c(0, dt(seq(-3, t0, 0.01), 10), 0) cord.x2 <- c(-t0, seq(-t0, 3, 0.01), 3) cord.y2 <- c(0, dt(seq(-t0, 3, 0.01), 10), 0) curve(dt(x,10),xlim=c(-3,3),ylab="density",main='',xlab="t-value") polygon(cord.x1,cord.y1,col='red') polygon(cord.x2,cord.y2,col='red')
The p-value is not the probability that the null hypothesis is true or the probability that the alternative hypothesis is false. It is not connected to either.
The p-value is not the probability that a finding is “merely a fluke.” In fact, the calculation of the p-value is based on the assumption that every finding is the product of chance alone.
The p-value is not the probability of falsely rejecting the null hypothesis.
The p-value is not the probability that replicating the experiment would yield the same conclusion.
The significance level, \(\alpha\), is not determined by the p-value. The significance level is decided by the experimenter a-priori and compared to the p-value that is obtained a-posteriori.
The p-value does not indicate the size or importance of the observed effect - they are related together with sample size.
If an infinite number of samples were drawn and CI’s computed, then the true population mean \(\mu\) would be in at least 95% of these intervals
\[ 95\%~CI=\bar{x}\pm{t}_{(1-\alpha/2)}\cdot SEM \]
Example
x.bar <- 7.6 # sample mean SEM <- 2.1 # standard error of the mean n <- 11 # sample size df <- n-1 # degrees of freedom alpha <- .15 # significance level t.crit <- qt(1 - alpha / 2, df) # t(1 - alpha / 2) for df = 10 c(x.bar - t.crit * SEM, x.bar + t.crit * SEM)
## [1] 4.325605 10.874395
Neyman, J. (1934). On the Two Different Aspects of the Representative Method:
The Method of Stratified Sampling and the Method of Purposive Selection.
Journal of the Royal Statistical Society, Vol. 97, No. 4 (1934), pp. 558-625
Confidence intervals are frequently misunderstood, even well-established researchers sometimes misinterpret them. .
that there is a 95% probability that the interval covers the population parameter
Once an experiment is done and an interval is calculated, the interval either covers, or does not cover the parameter value. Probability is no longer involved.
The 95% probability only has to do with the estimation procedure.
100 simulated samples from a population with \(\mu = 0\) and \(\sigma^2=1\). Out of 100 samples, only 5 samples have confidence intervals that do not cover the population mean.